GPs are fantastic, but they are not without their drawbacks.
Flexibility is another.
In this segment we’ll try to address those issues, ideally simultaneously.
Inroads into faster GP modeling are being made from a number of angles, usually by approximation.
The trouble is, not many of them come with software.
But there are a couple of underlying themes, and representative softwares therein.
And in fact all three can be seen as mechanisms for inducing sparsity in the covariance structure.
It is worth noting that you can’t just truncate small entries of \(K_{ij}\),
A kernel \(k_{r_\max}(r)\), where \(r = |x - x'|\), is said to have compact support if \(k_{r_\max}(r) = 0\) when \(r > r_\max\).
A compactly supported kernel (CSK) introduces zeros into the the covariance matrix,
Two families of CSKs, the Bohman and truncated power, offer decent approximations to the power exponential family,
These functions are zero for \(r \geq r_\max\), and for \(r < r_\max\):
\[ \begin{aligned} k^\mathrm{B}_{r_\max}(r) &= \left(1-\frac{r}{r_\max}\right) \cos \left(\frac{\pi r}{r_\max}\right) + \frac{1}{\pi}\sin\left(\frac{\pi r}{r_\max}\right) \\ k^\mathrm{tp}_{r_\max}(r; \alpha, \nu) &= [1 - (r/r_\max)^\alpha]^\nu, \quad \mbox{ where } 0 < \alpha < 2 \mbox{ and } \nu \geq \nu_d(\alpha). \end{aligned} \]
Notice that \(r_\max\) plays a dual role here, controlling both lengthscale and degree of sparsity.
Lets code one of these up and see how it works.
kB <- function(r, rmax)
{
rnorm <- r/rmax
k <- (1 - rnorm) * cos(pi*rnorm) + sin(pi*rnorm)/pi
k <- k * (r < rmax)
}
First calculate some distances …
library(plgp)
X <- matrix(seq(0, 10, length=2000), ncol=1)
D <- distance(X)
… and then feed them into \(k^B_{r_\max}(\cdot)\).
Seems to work …
K1 <- kB(sqrt(D), 1)
mean(K1 > 0)
## [1] 0.18955
K2 <- kB(sqrt(D), 2)
mean(K2 > 0)
## [1] 0.3596
K025 <- kB(sqrt(D), 0.25)
mean(K025 > 0)
## [1] 0.0488875
We’ll need to invest in a sparse matrix library, e.g., spam or Matrix.
library(Matrix)
K <- exp(-D + diag(eps, nrow(D)))
rbind(system.time(chol(K)),
system.time(chol(Matrix(K1, sparse=TRUE))),
system.time(chol(Matrix(K2, sparse=TRUE))),
system.time(chol(Matrix(K025, sparse=TRUE))))
## user.self sys.self elapsed user.child sys.child
## [1,] 1.076 0.000 1.079 0 0
## [2,] 0.116 0.000 0.119 0 0
## [3,] 0.244 0.020 0.261 0 0
## [4,] 0.068 0.004 0.073 0 0
We want to encourage sparsity because that means speed,
Consider a simple 1-d process, observed on a grid.
x <- c(1,2,4,5,6,8,9,10)/11; n <- length(x)
D <- distance(as.matrix(x))
K <- exp(-5*sqrt(D)^1.5) + diag(eps, n); Ki <- solve(K)
library(mvtnorm); y <- t(rmvnorm(1, sigma=K))
Here is the prediction from the “ideal fit” to that data.
xx <- seq(0, 1, length=100)
DX <- distance(as.matrix(x), as.matrix(xx))
KX <- exp(-5*sqrt(DX)^1.5)
m <- t(KX) %*% Ki %*% y; Sigma <- diag(1+eps, ncol(KX)) - t(KX) %*% Ki %*% KX
q1 <- qnorm(0.05, m, sqrt(diag(Sigma))); q2 <- qnorm(0.95, m, sqrt(diag(Sigma)))
plot(x, y, ylim=c(-2,2))
lines(xx, m); lines(xx, q1, lty=2); lines(xx, q2, lty=2)
Consider the Bohman CSK with \(r_\max = 0.1\).
K1 <- kB(sqrt(D), 0.1)
KX1 <- kB(sqrt(DX), 0.1)
Ki1 <- solve(K1)
m1 <- t(KX1) %*% Ki1 %*% y
tau2 <- drop(t(y) %*% Ki1 %*% y)/n
Sigma1 <- tau2 * (1 - t(KX1) %*% Ki1 %*% KX1)
q11 <- qnorm(0.05, m1, sqrt(diag(Sigma1)))
q21 <- qnorm(0.95, m1, sqrt(diag(Sigma1)))
Too wiggly and too wide.
plot(x, y, ylim=c(-2,2)); lines(xx, m); lines(xx, q1, lty=2); lines(xx, q2, lty=2)
lines(xx, m1, col=2); lines(xx, q11, col=2, lty=2); lines(xx, q21, col=2, lty=2)
The mean is off because aggregation transpires in a narrower window, and variance is larger simply because sparse \(K\) leads to larger \(K^{-1}\).
Kaufman et al. (2011) propose augmenting with a rich nonlinear mean function to “mop up” the long range non-linearity,
Basis expansion is a good way to define a rich non-linear mean structure.
Kaufman et al, found that Legnedre polynomials (in coded inputs) works well.
library(SparseEm)
leg01 <- legFun(0, 1)
degree <- 4
X <- leg01(x, terms=polySet(1, degree, 2, degree))
colnames(X) <- paste("l", 0:(ncol(X)-1), sep="")
print(X <- data.frame(X))
## l0 l1 l2 l3 l4
## 1 1 -1.4171325 1.1272739 -0.3756927 -0.5243494
## 2 1 -1.1022142 0.2402387 0.8209581 -1.2783963
## 3 1 -0.4723775 -0.8685553 0.9481768 0.3608360
## 4 1 -0.1574592 -1.0903141 0.3558148 1.0329212
## 5 1 0.1574592 -1.0903141 -0.3558148 1.0329212
## 6 1 0.7872958 -0.4250377 -1.1827363 -0.6390957
## 7 1 1.1022142 0.2402387 -0.8209581 -1.2783963
## 8 1 1.4171325 1.1272739 0.3756927 -0.5243494
First, lets take a look at how well the basis expansion works on its own.
lfit <- lm(y~.-1, data=X)
XX <- leg01(xx, terms=polySet(1, degree, 2, degree))
colnames(XX) <- paste("l", 0:(ncol(X)-1), sep="")
p <- predict(lfit, newdata=data.frame(XX), interval="prediction", level=0.9)
(Try increasing the degree to 5 or 6; what happens?)
Maybe a better fit than the CSK GP, but over-smoothed compared to the ideal GP.
plot(x, y, ylim=c(-2,2)); lines(xx,m); lines(xx, q1, lty=2); lines(xx, q2, lty=2)
lines(xx, m1, col=2); lines(xx, q11, col=2, lty=2); lines(xx, q21, col=2, lty=2)
lines(xx,p[,1],col=3); lines(xx,p[,2],col=3,lty=2); lines(xx,p[,3],col=3,lty=2)
How about if we combine the rich mean function with CSK on the residuals?
Here is a CSK fit to the residuals from the process above.
m2 <- t(KX1) %*% Ki1 %*% lfit$resid
tau22 <- drop(t(lfit$resid) %*% Ki1 %*% lfit$resid)/n
Sigma2 <- tau22 * (1 - t(KX1) %*% Ki1 %*% KX1)
Predicting by combining the means, and taking variance estimates from CSK.
p <- predict(lfit, newdata=data.frame(XX), interval="prediction", level=0.9)
m2 <- p[,1] + m2
q12 <- qnorm(0.05, m2, sqrt(diag(Sigma2)))
q22 <- qnorm(0.95, m2, sqrt(diag(Sigma2)))
Much better on mean and variance.
plot(x, y, ylim=c(-2,2)); lines(xx,m); lines(xx, q1, lty=2); lines(xx, q2, lty=2)
lines(xx, m1, col=2); lines(xx, q11, col=2, lty=2); lines(xx, q21, col=2, lty=2)
lines(xx,m2,col=4); lines(xx,q12,col=4,lty=2); lines(xx,q22,col=4,lty=2)
Now, this illustration does not provide a full accounting of the uncertainty.
p[,1],p[,2:3].We used the true values of the correlation function as plug-ins for the kernel of the residual process,
The only thing we estimated was the scale \(\hat{\tau}^2\) …
c(tau2, tau22)
## [1] 0.12072829 0.03106743
Kauffman, et al. describe a prior distribution linking together the \(r_{\max,j}\) for each input dimension \(j\),
to produce a covariance matrix with a certain degree of sparsity. E.g.,
\[ r_\max \mbox{ uniform in } R_C = \left\{ r_\max \in \mathbb{R}^d : r_{\max,j} \geq 0, \sum_{j=1}^d r_{\max,j} \leq C \right\}. \]
This allows some of the \(r_{\max,j}\) to be large to reflect a high degree of correlation in particular input directions,
This is paired with the usual reference priors for scale (\(\tau^2)\) and regression coefficient (\(\beta\)) parameters for the Legendre basis.
Kaufman, et al., provide a numerical procedure estimating what \(C\) would yield desired level of sparsity.
Inference is carried out by a typical mixture of Metropolis and Gibbs steps.
Implementation detail:
Kauffman, et al., recommend Legendre polynomials up to degree 5 in a “tensor product” form for their motivating cosmology example, including
Consider the borehole data, which is a classic computer experiments example (Morris, et al., 1993).
It is a function of eight inputs, modeling water flow through a borehole.
\[ y = \frac{2\pi T_u [H_u - H_l]}{\log\left(\frac{r}{r_w}\right) \left[1 + \frac{2 L T_u}{\log (r/r_w) r_w^2 K_w} + \frac{T_u}{T_l} \right]}. \]
The input ranges are
\[ \begin{aligned} r_w &\in [0.05, 0.15] & r &\in [100,5000] & T_u &\in [63070, 115600] \\ T_l &\in [63.1, 116] & H_u &\in [990, 1110] & H_l &\in [700, 820] \\ L &\in [1120, 1680] & K_w &\in [9855, 12045]. \end{aligned} \]
Here is an implementation in coded inputs.
borehole <- function(x){
rw <- x[1] * (0.15 - 0.05) + 0.05
r <- x[2] * (50000 - 100) + 100
Tu <- x[3] * (115600 - 63070) + 63070
Hu <- x[4] * (1110 - 990) + 990
Tl <- x[5] * (116 - 63.1) + 63.1
Hl <- x[6] * (820 - 700) + 700
L <- x[7] * (1680 - 1120) + 1120
Kw <- x[8] * (12045 - 9855) + 9855
m1 <- 2 * pi * Tu * (Hu - Hl)
m2 <- log(r / rw)
m3 <- 1 + 2*L*Tu/(m2*rw^2*Kw) + Tu/Tl
return(m1/m2/m3)
}
Here is a LHS training and testing partition.
n <- 4000
npred <- 500
dim <- 8
library(lhs)
x <- randomLHS(n+npred, dim)
y <- apply(x, 1, borehole)
ypred.0 <- y[-(1:n)]; y <- y[1:n]
xpred <- x[-(1:n),]; x <- x[1:n,]
Then set up the model and find the \(C\) that gives the a desired level of sparsity.
degree <- 2; maxint <- 2
sparsity <- 0.99
mc <- find.tau(den = 1 - sparsity, dim = ncol(x)) * ncol(x)
mc
## [1] 2.707658
Now, lets gather 2000 samples from the posterior,
B <- 2000
suppressWarnings({
time1 <- system.time(
tau <- mcmc.sparse(y, x, mc=mc,degree=degree,maxint=maxint,B=B,verbose=FALSE))
})
And then make predictions on the testing set,
burnin <- 500
index <- seq(burnin+1, B, by = 10)
suppressWarnings({
time2 <- system.time(ypred.sparse <-
pred.sparse(tau[index,], x,y,xpred,degree=degree,maxint=maxint,verbose=FALSE))
})
Pretty good on the convergence front.
par(mfrow=c(1,2)); matplot(tau, type = "l", xlab="iter")
plot(apply(tau, 1, sum), type = "l", xlab="iter")
Kauffman, et al. provide a non-sparse version,
This is going to be really slow, so we’ll do an order of magnitude fewer MCMC iterations.
B <- 200
suppressWarnings({
time3 <- system.time(phi <- mcmc.nonsparse(y, x, B=B, verbose=FALSE))
burnin <- 50
index <- seq(burnin+1, B, by = 10)
time4 <- system.time(ypred.nonsparse <-
pred.nonsparse(phi[index,], x, y, xpred, 2, verbose=FALSE))
})
Not converged, but that’s about all we can do in a reasonable amount of time.
matplot(phi, type = "l", xlab="iter")
In terms of time, about a \(3\times\) speedup.
print(times <- c(sparse=as.numeric(time1[3]+time2[3]),
dense=10*as.numeric(time3[3]+time4[3])))
## sparse dense
## 3590.044 9732.370
In terms of accuracy … (unfair to the dense version since it hasn’t burned in)
s2s <- ypred.sparse$var; s2n <- ypred.nonsparse$var
print(scores <- c(sparse=mean(- (ypred.sparse$mean - ypred.0)^2/s2s - log(s2s)),
dense=mean(- (ypred.nonsparse$mean - ypred.0)^2/s2n - log(s2n))))
## sparse dense
## -1.616263 -2.636446
Lets see how much faster (and how much less accurate) a 99.9% sparse version is.
sparsity <- 0.999
mc <- find.tau(den = 1 - sparsity, dim = ncol(x)) * ncol(x)
B <- 2000
suppressWarnings({
time5 <- system.time(
tau <- mcmc.sparse(y, x, mc=mc,degree=degree,maxint=maxint,B=B,verbose=FALSE))
})
burnin <- 500
index <- seq(burnin+1, B, by = 10)
suppressWarnings({
time6 <- system.time(ypred.sparse <-
pred.sparse(tau[index,], x,y,xpred,degree=degree,maxint=maxint,verbose=FALSE))
})
Much faster.
times <- c(times, s999=as.numeric(time5[3]+time6[3]))
times
## sparse dense s999
## 3590.044 9732.370 281.875
A little less accurate.
s2 <- ypred.sparse$var
scores <- c(scores, s999=mean(- (ypred.sparse$mean - ypred.0)^2/s2 - log(s2)))
scores
## sparse dense s999
## -1.616263 -2.636446 -1.796645
Another way to induce sparsity in the covariance structure is to
The trouble is, its hard to know just how to split things up.
Once we’ve figured that out, it makes sense to fit hyperparameters independently in each partition,
… hard to do, especially with Gaussian processes.
I know of only two (successful) attempts.
Software is a whole different ball-game.
Tesselations are easy to characterize mathematically, but a nightmare computationally.
Trees are easy mathematically too,
The plan is to use GPs at the leaves, but lets take a step back first …
Use of trees in regression dates back to AID (Automatic Interaction Detection) by Morgan and Sonquist (1963)
Classification and Regression Trees (CART) (Breiman, et al., 1984),
The selling point was that
But fitting partition structure (depth, splits, etc.) isn’t easy. You need
And there are lots of way to skin that cat.
I prefer the likelihood, whenever possible.
Given a particular tree, \(\mathcal{T}\), the (marginal) likelihood factorizes into a product form.
\[ p(y^n \mid \mathcal{T}, x^n) \equiv p(y_1, \dots, y_n \mid \mathcal{T}, x_1,\dots, x_n) = \prod_{\eta \in \mathcal{L}_\mathcal{T}} p(y^\eta \mid x^\eta) \]
The simplest leaf model for regression is the constant model with unknown mean and variance (\(\theta_\eta = (\mu_\eta, \sigma^2_\eta)\)):
\[ \begin{aligned} p(y^\eta \mid \mu_\eta, \sigma^2_\eta, x^\eta) &\propto \sigma_\eta^{-|\eta|} \exp\left\{\textstyle-\frac{1}{2\sigma_\eta^2} \sum_{y \in \eta} (y - \mu_\eta)^2\right\} \\ \mbox{so that } \;\; p(y^\eta \mid x^\eta) & = \frac{1}{(2\pi)^{\frac{|\eta|-1}{2}}}\frac{1}{\sqrt{|\eta|}} \left(\frac{s_\eta^2}{2}\right)^{-\frac{|\eta|-1}{2}} \Gamma\left(\frac{|\eta|-1}{2}\right) \end{aligned} \]
taking \(\pi(\mu_\eta, \sigma_\eta^2) \propto \sigma_\eta^{-2}\).
Some kind of regularization is needed for inference,
There were two papers, published at almost the same time, proposing a so-called Bayesian CART model,
Denison, Mallick & Smith (1998) were looking for a “light touch”,
Chipman, George & McCulloch (1998) called for more structure
Trees may stochastically be grown from a leaf node \(\eta\) with a probability that depends on the depth \(D_\eta\) of that node in the tree, \(\mathcal{T}\).
\[ p_{\mathrm{split}}(\eta,\mathcal{T}) = \alpha(1+D_{\eta})^{-\beta} \]
This induces a prior for the full tree \(\mathcal{T}\) via the probability that internal nodes \(\mathcal{I}_\mathcal{T}\) split and leaves \(\mathcal{L}_\mathcal{T}\) do not:
\[ \pi(\mathcal{T}) \propto \prod_{\eta\,\in\,\mathcal{I}_\mathcal{T}} p_{\mathrm{split}}(\eta, \mathcal{T}) \prod_{\eta\,\in\,\mathcal{L}_\mathcal{T}} [1-p_{\mathrm{split}}(\eta, \mathcal{T})]. \]
As in the DMS prior, CGM retains uniformity on everything else:
Inference then proceeds by MCMC.
Here is how the MCMC would go.
\[ \frac{p(\mathcal{T}' \mid y^n, x^n)}{p(\mathcal{T} \mid y^n, x^n)} = \frac{p(y^n \mid \mathcal{T}', x^n)}{p(y^n \mid \mathcal{T}, x^n)} \times \frac{\pi(\mathcal{T}')}{\pi(\mathcal{T})}. \]
Potential savings comes with local moves in tree space,
What do tree proposals look like? Here is an example of the four most popular “tree moves”.
As an illustration, consider the motorcycle accident data in the MASS library for R.
library(MASS)
The tgp package will sample from the “Bayesian CART (BCART)” posterior,
library(tgp)
XX <- seq(0,max(mcycle[,1]), length=1000)
out.bcart <- bcart(X=mcycle[,1], Z=mcycle[,2], XX=XX, R=100, verb=0)
XX must be specified with the fitting call if you want predictions averaged over all MCMC iterations.outp.bcart <- predict(out.bcart, XX=XX)
To visualize the predictive surface(s), I’m going to write a little macro that we can re-use in several variations later.
plot.moto <- function(out, outp)
{
plot(outp$XX[,1], outp$ZZ.km, ylab="accel", xlab="time",
ylim=c(-150, 80), lty=2, col=1, type="l")
points(mcycle)
lines(outp$XX[,1], outp$ZZ.km + 1.96*sqrt(outp$ZZ.ks2), col=2, lty=2)
lines(outp$XX[,1], outp$ZZ.km - 1.96*sqrt(outp$ZZ.ks2), col=2, lty=2)
lines(out$XX[,1], out$ZZ.mean, col=1, lwd=2)
lines(out$XX[,1], out$ZZ.q1, col=2, lwd=2)
lines(out$XX[,1], out$ZZ.q2, col=2, lwd=2)
}
The MCMC is good at smoothing out rough transitions.
plot.moto(out.bcart, outp.bcart)
What do we get?
The MAP estimate has many of those features,
Try tgp.trees(out.bcart) to visualize the best trees.
Any data type/leaf model may be used without extra computational effort if \(p(y^\eta \mid x^\eta)\) is analytic.
A so-called Bayesian treed linear model (BTLM) (Chipman, et al., 2002) uses
\[ \begin{aligned} p(y^\eta \mid \beta_\eta, \sigma_\eta^2, x^\eta) & \propto \sigma_\eta^{-|\eta|} \exp \{ (y^\eta - X^\eta \beta_\eta)^2/2\sigma_\eta^2\} \;\; \mbox{ and } \;\; \pi(\beta_\eta, \sigma_\eta^2) \propto \sigma_\eta^{-2}. \end{aligned} \]
In that case we have
\[ p(y^\eta | x^{\eta}) = \frac{1}{(2\pi)^{\frac{|\eta|-d-1}{2}}} \left(\frac{|\mathcal{G}_\eta^{-1}|}{|\eta|}\right)^{\frac{1}{2}} \left(\frac{s_\eta^2-\mathcal{R}_\eta}{2}\right)^{-\frac{|\eta|-d-1}{2}} \Gamma\left(\frac{|\eta|-d-1}{2}\right), \]
where \(\mathcal{G}_\eta = \bar{X}_\eta^\top\bar{X}_\eta\), \(\mathcal{R}_\eta = \hat{\beta}_\eta^\top \mathcal{G}_\eta \hat{\beta}_\eta\) and \(d\)-column \(\bar{X}_\eta\) is a centered \(X_\eta\).
Here is the fit in tgp,
XX during the fitting stage for full posterior averaging in prediction.out.btlm <- btlm(X=mcycle[,1], Z=mcycle[,2], XX=XX, R=100, verb=0)
outp.btlm <- predict(out.btlm, XX=XX)
We’ll use the plotting macro we made for easy visualization.
Fewer partitions, but is it better?
plot.moto(out.btlm, outp.btlm)
If the response is categorical,
Other members of the exponential family proceed similarly:
However, to my knowledge none of these choices have been actually implemented as leaf models in a Bayesian treed regression setting.
Technically, any leaf model can be deployed
An important exception is GPs.
GP leaves encourage shallow trees with fewer leaf nodes.
First, lets look at a stationary GP fit to the motorcycle data.
Smooth, but equally as bad as BCART and BTLM in other (opposite) respects?
out.bgp <- bgp(X=mcycle[,1], Z=mcycle[,2], XX=XX, R=10, verb=0)
outp.bgp <- predict(out.bgp, XX=XX)
plot.moto(out.bgp, outp.bgp)
Bayesian treed Gaussian process (TGP) models (Gramacy & Lee, 2008) can offer the best of both worlds, marrying
Their divide-and-conquer nature mean
Perversely, the two go hand in hand.
Same program as above, but with btgp. (Try verb=1 for a progress meter.)
out.btgp <- btgp(X=mcycle[,1], Z=mcycle[,2], XX=XX, R=30, bprior="b0", verb=0)
The argument bprior="b0" is optional.
tgp fits a linear mean with the GP at the leaves.brior="bflat").bprior="b0" creates a hierarchical prior linking all the \(\beta_\eta\) and \(\sigma_\eta\), \(\eta \in \mathcal{L}_\mathcal{T}\), together.As before, we can extract the MAP predictor, mostly for comparison.
outp.btgp <- predict(out.btgp, XX=XX)
Pretty darn good, if you ask me.
plot.moto(out.btgp, outp.btgp)
Even better with sometimes fully linear leaves (Gramacy & Lee, 2008).
out.btgpllm <- btgpllm(X=mcycle[,1],Z=mcycle[,2], XX=XX, R=30,bprior="b0",verb=0)
outp.btgpllm <- predict(out.btgpllm, XX=XX)
plot.moto(out.btgpllm, outp.btgpllm)
Lets revisit the 2-d exponential data,
tgp.exp2d.data <- exp2d.rand(n1=25, n2=75)
X <- exp2d.data$X; Z <- exp2d.data$Z
XX <- exp2d.data$XX
First lets fit an ordinary (Bayesian) GP.
corr="exp".out.bgp <- bgp(X=X, Z=Z, XX=XX, corr="exp", verb=0)
A stationary process means uniform uncertainty (in distance).
plot(out.bgp, pc="c")
A nonstationary process means we can learn that it is hard in the SW corner.
out.btgp <- btgp(X=X, Z=Z, XX=XX, corr="exp", R=10, verb=0)
plot(out.btgp, pc="c")
A clean treed partition of the input space.
tgp.trees(out.btgp, heights="map")
Treed GPs were invented for the rocket booster (LGBB) data.
lgbb.as <- read.table("lgbb/lgbb_as.txt", header=TRUE)
lgbb.rest <- read.table("lgbb/lgbb_as_rest.txt", header=TRUE)
Those files contain
Here we develop the “final predictive surface” after the sequential design effort, on the lift response.
X <- lgbb.as[,2:4]
Y <- lgbb.as$lift
XX <- lgbb.rest[2:4]
c(X=nrow(X), XX=nrow(XX))
## X XX
## 780 37128
The fit is pretty computationally intensive, so we won’t do any restarts (with default R=1).
t1 <- system.time(fit <- btgpllm(X=X, Z=Y, XX=XX, bprior="b0", verb=0))
t1[3]
## elapsed
## 7925.457
We can inspect a 2-d slice of the posterior predictive surface, say for a side-slip-angle of zero.
plot(fit, slice=list(x=3, z=0), gridlen=c(100, 100), layout="surf", span=0.01)
A clean two-element partition.
tgp.trees(fit, heights="map")
Future samples will be in the sub-sonic regime.
plot(fit, slice=list(x=3,z=0), gridlen=c(100,100), layout="as",as="alm",span=0.01)
But yeah, that was pretty slow.
tgp contains a number of “switches” and “knobs” to help speed things up,
One way is to change the prior.
Another way is via the MCMC.
linburn=TRUE will “burn-in” the BTGP MCMC with a BTLM, and then “switch-on” GPs at the leaves toward the end.linburn predictive meanNot much impact on the surface.
t2 <- system.time(fit2 <-
btgpllm(X=X, Z=Y, XX=XX, bprior="b0", linburn=TRUE, verb=0))
plot(fit2, slice=list(x=3, z=0), gridlen=c(100, 100), layout="surf", span=0.01)
An order of magnitude faster.
c(full=t1[3], linburn=t2[3])
## full.elapsed linburn.elapsed
## 7925.457 718.723
Indeed, much of the space is plausibly piece-wise linear anyway,
We already talked about ALM
XX locations comes for free.ALC is available via Ds2X=TRUE,
XX.XX are also used as reference locations, by default, this can be a tall order.
By default, tgp provides in-sample predictions, but those aren’t necessary for sequential design.
pred.n=FALSE.Often simple leaf models, e.g., constant or linear, lead to great sequential designs
The linburn option offers a nice compromise
linburn ALMplot(fit2, slice=list(x=3,z=0), gridlen=c(100,100), layout="as",as="alm",span=0.01)
The argument improv=TRUE causes samples to be gathered from the posterior mean of improvements
Powered up improvements are also available.
This is just a taste. For further details/tutorials, see
The local approximate GP (laGP) idea has aspects in common with partition based schemes,
It is reminiscent of what Cressie (1991, pp. 131-134) called an “ad hoc” method of local kriging neighborhoods,
I think Cressie didn’t anticipate
All together, the idea is more modern than could have been anticipated in 1991 (and earlier).
It draws, in part, on recent findings
But a big divergence from previous approaches, particularly those from the spatial stats literature, lies in an emphasis on prediction
laGP is an example of transductive learning (Vapnik, 1995), as opposed to inductive learning, in that
For the next little bit, focus on prediction at a single testing or predictive location \(x\).
Lets think about the properties of GP predictive equations (an emulator in the computer experiments context, say) at \(x\).
So how about we search for the most useful data points (a sub-design relative to \(x\))
One option is a nearest neighbor (NN) subset:
The best modern reference for this idea is Emery (2009).
This is different (and much simpler than) what other authors have recently dubbed NNGP (Datta, et al, 2016), making things super confusing.
An animation … (1)
An animation … (2)
An animation … (3)
An animation … (4)
An animation … (5)
Is it sensible?
Is it good?
Our questions:
Yes!: with a greedy/forward stepwise scheme.
For a particular \(x\), solve a sequence of easy decision problems.
For \(j=n_0, \dots,n\):
Optimizing the criterion (1), and updating the GP (2), must not exceed \(O(j^2)\) so the total scheme remains in \(O(n^3)\).
… for sequential (sub-) design:
Given \(D_j(x)\) for particular \(x\), we search for \(x_{j+1}\) by
\[ \begin{aligned} J(x_{j+1}, x) &= \mathbb{E} \{ [Y(x) - \mu_{j+1}(x; \hat{\theta}_{j+1})]^2 \mid D_j(x) \} \\ &\approx V_j(x \mid x_{j+1}; \hat{\theta}_j) + \left(\frac{\partial \mu_j(x; \theta)}{\partial \theta} \Big{|}_{\theta = \hat{\theta}_j}\right)^2 /\, \mathcal{G}_{j+1}(\hat{\theta}_j). \end{aligned} \]
The approximation stems from
\[ J(x_{j+1}, x) \approx \bbox[lightblue,2pt]{V_j(x \mid x_{j+1}; \hat{\theta}_j)} + \left(\frac{\partial \mu_j(x; \theta)}{\partial \theta} \Big{|}_{\theta = \hat{\theta}_j}\right)^2 / \mathcal{G}_{j+1}(\hat{\theta}_j) \]
\[ \begin{aligned} \bbox[lightblue,2pt]{V_j(x \mid x_{j+1}; \theta)} &= \frac{\psi_j}{j-2} v_{j+1}(x; \theta), \\ \mbox{where } \;\;\; v_{j+1}(x; \theta) &= \left[ K_{j+1}(x, x) - k_{j+1}^\top(x) K_{j+1}^{-1} k_{j+1}(x) \right] \;\;\; \mbox{and } \;\;\; \psi_j = j \hat{\tau}^2_j. \end{aligned} \]
Integrating over \(x\) yields the ALC greedy scheme for approximate global A-optimal design (Seo, et al., 2000; Cohn, 1996),
\[ J(x_{j+1}, x) \approx V_j(x \mid x_{j+1}; \hat{\theta}_j) + \left(\bbox[lightgreen,2pt]{\frac{\partial \mu_j(x; \theta)}{\partial \theta}} \Big{|}_{\theta = \hat{\theta}_j}\right)^2 / \mathcal{G}_{j+1}(\hat{\theta}_j) \]
where \(\dot{k}_j(x)\) is the \(j\)-length column vector of derivatives of the correlation function \(K(x, x_k)\), for \(k=1,\dots,j\), with respect to \(\theta\).
\[ J(x_{j+1}, x) \approx V_j(x \mid x_{j+1}; \hat{\theta}_j) + \left(\frac{\partial \mu_j(x; \theta)}{\partial \theta} \Big{|}_{\theta = \hat{\theta}_j}\right)^2 / \bbox[lightpink,2pt]{\mathcal{G}_{j+1}(\hat{\theta}_j)} \]
\[ \begin{aligned} &\bbox[lightpink,2pt]{\mathcal{G}_{j+1}(\theta)} = F_j(\theta) + \mathbb{E} \left\{- \frac{\partial^2 \ell_j(y_{j+1}; \theta)}{\partial \theta^2} \Big{|} D_j; \theta \right\} \label{eq:G} \\ &\approx F_j(\theta) \!+\! \frac{1}{2 V_j(x_{j+1};\theta)^2} \!\times\! \left( \frac{\partial V_j(x_{j+1}; \theta)}{\partial \theta} \right)^2 \!\!+\! \frac{1}{V_j(x_{j+1}; \theta)} \left( \frac{\partial \mu_j(x_{j+1}; \theta)}{\partial \theta} \right)^2\!, \end{aligned} \]
where \(F_j(\theta) = - \ell''(Y_j; \theta)\), and with \(\dot{\mathcal{K}}_j = \dot{K}_j K_j^{-1}\) and \(\tilde{k}_j(x) = K_j^{-1} k_j(x)\),
\[ \begin{aligned} \frac{\partial V_j(x; \theta)}{\partial \theta} & = \frac{Y_j^\top K_j^{-1}\dot{\mathcal{K}}_j Y_j }{j-2} \left(K(x, x) -k_j^\top(x) \tilde{k}_j(x)\right) \\ &\;\;\; - \psi_j\left[ \dot{k}_j(x) \tilde{k}_j(x) + \tilde{k}_j(x)^\top (\dot{k}_j(x)-\dot{\mathcal{K}}_j k_j(x)) \right]. \end{aligned} \]
The MSPE criteria nests an ALC-like criteria,
Reducing future variance is a sensible criteria in its own right,
Lets consider a full-\(N\) sized design on a 2-d grid.
xg <- seq(-2, 2, by = 0.02)
X <- as.matrix(expand.grid(xg, xg))
print(N <- nrow(X))
## [1] 40401
Technically, the greedy sub-design method doesn’t require a response
But the laGP function wants one, because it is designed to make predictions
f2d <- function(x) {
g <- function(z) {
return(exp(-(z - 1)^2) + exp(-0.8 * (z + 1)^2) - 0.05 * sin(8 * (z + 0.1)))
}
return(-g(x[, 1]) * g(x[, 2]))
}
Y <- f2d(X)
Consider a prediction location \(x\), denoted by Xref in the code below,
library(laGP, quietly=TRUE, warn.conflicts=FALSE)
Xref <- matrix(c(-1.725, 1.725), nrow=1)
p.mspe <- laGP(Xref, 6, 50, X, Y, d=0.1, method="mspe")
p.alc <- laGP(Xref, 6, 50, X, Y, d=0.1, method="alc")
Both designs
plot(X[p.mspe$Xi,], xlab="x1", ylab="x2", type="n",
xlim=range(X[p.mspe$Xi,1]), ylim=range(X[p.mspe$Xi,2]))
text(X[p.mspe$Xi,], labels=1:length(p.mspe$Xi), cex=0.7)
text(X[p.alc$Xi,], labels=1:length(p.alc$Xi), cex=0.7, col=2)
points(Xref[1], Xref[2],pch =19, col=3)
legend("right", c("mspe", "alc"), text.col=c(1,2), bty="n")
Why do the criteria not prefer the closest possible points, i.e., the NNs?
Gramacy & Haaland (2016) explain that the form of the correlation has very little to do with it.
Consider the reduction in variance (an expression we have seen before):
\[ v_j(x; \theta) - v_{j+1}(x; \theta) = k_j^\top(x) G_j(x_{j+1}) v_j(x_{j+1}) k_j(x) + \cdots + K(x_{j+1},x)^2 / v_j(x_{j+1}) \]
\[ G_j(x') \equiv g_j(x') g_j^\top(x') \quad \mbox{where} \quad g_j(x') = - K_j^{-1} k_j(x')/v_j(x'). \]
So the criteria makes a trade-off:
Or in other words, the potential value of new design element \((x_{j+1}, y_{j+1})\) depends not just on its proximity to \(x\),
So we get a mixture of NNs and “satellite” points.
Why do the satellite points arrange themselves along “rays” emanating from \(x\)?
Gramacy & Haaland (2016) show some of the interesting patterns (ribbons and rings) that materialize in local designs depending on the kernel and parameterization.
The designs are qualitatively similar, and the predictions are nearly identical:
p <- rbind(c(p.mspe$mean, p.mspe$s2, p.mspe$df), c(p.alc$mean, p.alc$s2, p.alc$df))
colnames(p) <- c("mean", "s2", "df"); rownames(p) <- c("mspe", "alc")
p
## mean s2 df
## mspe -0.3724557 2.327819e-06 50
## alc -0.3724105 2.072180e-06 50
Although the designs are built using a fixed \(\theta = 0.1\), the predictive equations output at the end use local MLE calculations given the data \(D_n(x)\).
mle <- rbind(p.mspe$mle, p.alc$mle); rownames(mle) <- c("mspe", "alc")
mle
## d dits
## mspe 0.3512812 7
## alc 0.3395509 7
Finally, both local design methods are fast.
c(p.mspe$time, p.alc$time)
## elapsed elapsed
## 0.168 0.055
For a point of reference, inverting a \(4000 \times 4000\) matrix takes about five seconds on the same machine, using a mult-threaded BLAS/Lapack.
How do we extend this to predict on a big testing/predictive set?
One option is to serialize:
for loop over each \(x \in \mathcal{X}\).But why serialize when you can parallelize?
In laGP’s C implementation, that’s as simple as a “parallel-for” OpenMP pragma.
#ifdef _OPENMP
#pragma omp parallel for private(i)
#endif
for(i = 0; i < npred; i++) { ...
To illustrate, consider the following \(\sim 10\)K-element predictive grid in \([-2,2]^2\),
xx <- seq(-1.97, 1.95, by=0.04)
XX <- as.matrix(expand.grid(xx, xx))
YY <- f2d(XX)
The aGP function iterates over the elements of \(\mathcal{X} =\) XX,
omp.threads argument controls the number of OpenMP threads.nth <- 8
P.alc <- aGP(X, Y, XX, omp.threads=nth, verb=0)
persp(xx, xx, -matrix(P.alc$mean, ncol = length(xx)), phi = 45,
theta = 45, xlab = "x1", ylab = "x2", zlab = "yhat(x)")
Although the input dimension is low,
For a closer look, consider a slice through the predictive surface at \(x_2 = 0.51\).
The code below sets up the slice and its plot.
med <- 0.51
zs <- XX[, 2] == med
sv <- sqrt(P.alc$var[zs])
r <- range(c(-P.alc$mean[zs] + 2*sv, -P.alc$mean[zs] - 2*sv))
plot(XX[zs,1], -P.alc$mean[zs], type="l", lwd=2, ylim=r, xlab="x1", ylab="y")
lines(XX[zs,1], -P.alc$mean[zs] + 2*sv, col=2, lty=2, lwd=2)
lines(XX[zs,1], -P.alc$mean[zs] - 2*sv, col=2, lty=2, lwd=2)
lines(XX[zs,1], -YY[zs], col=3, lwd=2, lty=3)
The error bars are very tight on the scale of the response,
What don’t we see?
Accuracy, however, is not uniform.
diff <- P.alc$mean - YY
Systematic bias in prediction, although extremely small.
plot(XX[zs,1], diff[zs], type="l", lwd=2, xlab = "x1", ylab = "y(x)-yhat(x)")
Considering the density of the input design, one could easily guess that
Although an approximation, the local nature of modeling means that, from a global perspective,
aGP a degree of nonstationarity.aGP goes beyond that by learning separate \(\hat{\theta}_n(x)\) local to each \(x \in \mathcal{X}\)
In fact, the lengthscales vary spatially, and relatively smoothly.
plot(XX[zs,1], P.alc$mle$d[zs], type="l", lwd=2, xlab="x1", ylab="thetahat(x)")
df <- data.frame(y = log(P.alc$mle$d), XX)
lo <- loess(y ~ ., data=df, span=0.01)
lines(XX[zs,1], exp(lo$fitted)[zs], col=2, lty=2, lwd=2)
legend("topleft", "smoothed", col=2, lty=2, lwd=2, bty="n")
So even though the spatial field may be locally isotropic,
Nevertheless, even the extra degree of flexibility afforded by spatially varying \(\hat{\theta}_n(x)\) is not enough to entirely mitigate the small amount of bias we saw.
Several enhancements offer potential for improved performance.
Here, sub-design search is based on the smoothed lengthscales from the first stage.
P.alc2 <- aGP(X, Y, XX, d=exp(lo$fitted), omp.threads=nth, verb=0)
Now consider comparing the predictions from the first iteration to those from the second in terms of RMSE.
rmse <- data.frame(alc = sqrt(mean((P.alc$mean - YY)^2)),
alc2 = sqrt(mean((P.alc2$mean - YY)^2)))
rmse
## alc alc2
## 1 0.0006524959 0.0003222488
First, a harder example …
Check this out.
out1 <- aGP(x,y, xpred, d=list(max=20), omp.threads=nth, verb=0)
out2 <- aGP(x,y, xpred, d=list(start=out1$mle$d,max=20), omp.threads=nth, verb=0)
Much faster; much more accurate.
print(times <- c(times, aGP=as.numeric(out1$time), aGP2=as.numeric(out2$time)))
## sparse dense s999 aGP aGP2
## 3590.044 9732.370 281.875 9.324 8.893
s21 <- out1$var; s22 <- out2$var
print(scores <- c(scores, aGP=mean(-(out1$mean - ypred.0)^2/s21 - log(s21)),
aGP2=mean(-(out1$mean - ypred.0)^2/s22 - log(s22))))
## sparse dense s999 aGP aGP2
## -1.6162629 -2.6364460 -1.7966449 -0.5959322 -0.5614851
And we did all that with an isotropic correlation function,
outs <- aGPsep(x, y, xpred, d=list(max=20), omp.threads=nth, verb=0)
Similar compute times; quite a bit more accurate.
print(times <- c(times, aGPs=as.numeric(outs$time)))
## sparse dense s999 aGP aGP2 aGPs
## 3590.044 9732.370 281.875 9.324 8.893 5.013
s2 <- outs$var
print(scores <- c(scores, aGPs=mean(-(outs$mean - ypred.0)^2/s2 - log(s2))))
## sparse dense s999 aGP aGP2 aGPs
## -1.61626285 -2.63644596 -1.79664492 -0.59593218 -0.56148505 0.07414708
Surely something is lost on this local approach to GP approximation.
Kaufman et al. astutely observed that, especially when inducing sparsity in the covariance structure,
That’s not easily mapped to the laGP setup,
But the idea has merit, and we ought to be able to find an appropriate analog in the laGP world.
Instead, consider not a partition between trend and residual,
Liu (2014) showed that a consistent estimator of the global (separable) lengthscale can be estimated via (more manageably sized) random data subsets,
We’ll take this block LHS idea as inspiration
Consider a random 1000-sized subset.
n <- 1000
d2 <- darg(list(mle=TRUE, max=100), x)
subs <- sample(1:nrow(x), n, replace=FALSE)
gpsi <- newGPsep(x[subs, ], y[subs], rep(d2$start, dim), g=1/1000, dK=TRUE)
that <- mleGPsep(gpsi, param="d", tmin=d2$min, tmax=d2$max, ab=d2$ab, maxit=200)
psub <- predGPsep(gpsi, xpred, lite=TRUE)
deleteGPsep(gpsi)
Observe that local subset GP prediction is pretty good on its own,
s2 <- psub$s2
print(scores <- c(scores, sub=mean(-(psub$mean - ypred.0)^2/s2 - log(s2))))
## sparse dense s999 aGP aGP2 aGPs
## -1.61626285 -2.63644596 -1.79664492 -0.59593218 -0.56148505 0.07414708
## sub
## 0.70396672
The estimated lengthscales, stored in that, are super handy.
Don’t forget to scale both training and testing inputs.
scale <- sqrt(that$d)
xs <- x; xpreds <- xpred
for(j in 1:ncol(xs)) {
xs[,j] <- xs[,j] / scale[j]
xpreds[,j] <- xpreds[,j] / scale[j]
}
Now fit a local GP on the the scaled inputs, achieving a multiresolution effect; note that \(\theta\) is initialized to 1.
out3 <- aGP(xs, y, xpreds, d=list(start=1, max=20), omp.threads=nth, verb=0)
s2 <- out3$var
print(scores <- c(scores, aGPsm=mean(-(out3$mean - ypred.0)^2/s2 - log(s2))))
## sparse dense s999 aGP aGP2 aGPs
## -1.61626285 -2.63644596 -1.79664492 -0.59593218 -0.56148505 0.07414708
## sub aGPsm
## 0.70396672 1.18405236
The default nugget value in laGP and aGP is too large for most deterministic computer experiments applications.
g <- 1/10000000
out4 <- aGP(xs, y, xpreds, d=list(start=1, max=20), g=g, omp.threads=nth, verb=0)
s2 <- out4$var
print(scores <- c(scores, aGPsm2=mean(-(out4$mean - ypred.0)^2/s2 - log(s2))))
## sparse dense s999 aGP aGP2 aGPs
## -1.61626285 -2.63644596 -1.79664492 -0.59593218 -0.56148505 0.07414708
## sub aGPsm aGPsm2
## 0.70396672 1.18405236 5.51571554
Holy smokes!
Lets revisit the HST drag data we introduced a while back.
Recall that the goal was to be able to predict the drag coefficient (response), globally,
There are eight inputs, including HST’s panel angle, and files with runs obtained on LHS designs, separately for each chemical species (O, O\(_2\), N, N\(_2\), He, H).
Read in the data …
hstHe <- read.table("lanl/HST/hstHe.dat", header=TRUE)
nrow(hstHe)
## [1] 1000000
… and (as usual) work with coded the inputs.
m <- ncol(hstHe)-1
X <- hstHe[,1:m]
Y <- hstHe[,m+1]
maxX <- apply(X, 2, max)
minX <- apply(X, 2, min)
for(j in 1:ncol(X)) {
X[,j] <- X[,j] - minX[j]
X[,j] <- X[,j]/(maxX[j]-minX[j])
}
Consider a 10-fold CV setup …
cv.folds <- function (n, folds = 10)
split(sample(1:n), rep(1:folds, length = n))
f <- cv.folds(nrow(X), 10)
… but only “loop” through one fold here.
o <- f[[1]]
Xtest <- X[o,]; Xtrain <- X[-o,]
Ytest <- Y[o]; Ytrain <- Y[-o]
c(test=length(Ytest), train=length(Ytrain))
## test train
## 100000 900000
We’ll need it for our multiresolution approach later anyway, so lets start with a subset GP first.
da.orig <- darg(list(mle=TRUE), Xtrain, samp.size=10000)
sub <- sample(1:nrow(Xtrain), 1000, replace=FALSE)
gpsi <- newGPsep(Xtrain[sub,], Ytrain[sub], d=0.1, g=1/1000, dK=TRUE)
mle <- mleGPsep(gpsi, tmin=da.orig$min, tmax=10*da.orig$max, ab=da.orig$ab)
psub <- predGPsep(gpsi, Xtest, lite=TRUE)
deleteGPsep(gpsi)
rmspe <- c(sub=sqrt(mean((100*(psub$mean - Ytest)/Ytest)^2)))
rmspe
## sub
## 15.05449
How about a separable local GP?
alcsep <- aGPsep(Xtrain, Ytrain, Xtest, d=da.orig, omp.threads=nth, verb=0)
print(rmspe <- c(rmspe, alc=sqrt(mean((100*(alcsep$mean - Ytest)/Ytest)^2))))
## sub alc
## 15.05449 5.83785
Notice that a (\(10\times\)) smaller nugget doesn’t help here.
g <- 1/100000
alcsep2 <- aGPsep(Xtrain, Ytrain, Xtest, d=da.orig, g=g, omp.threads=nth, verb=0)
print(rmspe <- c(rmspe, alc2=sqrt(mean((100*(alcsep2$mean - Ytest)/Ytest)^2))))
## sub alc alc2
## 15.054490 5.837850 6.022708
First pre-scale the inputs with the mle calculated on the subset above.
for(j in 1:ncol(Xtrain)) {
Xtrain[,j] <- Xtrain[,j] / sqrt(mle$d[j])
Xtest[,j] <- Xtest[,j] / sqrt(mle$d[j])
}
Construct a default prior appropriate for the scaled inputs.
da.s <- darg(list(mle=TRUE), Xtrain, samp.size=10000)
da.s$start <- 1; if(da.s$max < 2) da.s$max <- 2
Now the local fit on the scaled inputs. Woot!
alcsep.s <- aGPsep(Xtrain, Ytrain, Xtest, d=da.s, omp.threads=nth, verb=0)
print(rmspe <- c(rmspe, alcs=sqrt(mean((100*(alcsep.s$mean - Ytest)/Ytest)^2))))
## sub alc alc2 alcs
## 15.0544902 5.8378503 6.0227078 0.8210352
Recall the modularized KOH calibration apparatus,
That is, we only need GP predictions at a relatively small set of locations,
Local GPs couldn’t be more ideal for this setup.
One drawback, however, is that the discrete nature of independent local design searches for \(\hat{y}^M(x_j^F, u)\),
is going to ensure that our objective \[ p(u) \left[ \max_{\theta_b} p_b(\theta_b | D^{B}_{N_F}(u))\right] \] is not a continuous in \(u\)
Gramacy, et al. (2015) suggest a derivative-free approach:
NOMAD.As MADS is a local solver, NOMAD requires initialization.
Gramacy et al. suggest choosing starting \(u\)-values from the best value(s) of the objective found on a small space-filling design.
The laGP package contains several functions that automate that objective, e.g.,
fcalib is like the calib function we implemented for the full GP case;discrep.est is like our bhat function;I’d love to show you the CRASH calibration,
Consider a computer model following the formula below, and its implementation in R.
\[ y^M(x, u) = \left(1-e^{-\frac{1}{2x_2}}\right) \frac{1000 u_1 x_1^3 + 1900 x_1^2+2092 x_1+60}{100 u_2 x_1^3 + 500 x_1^2 + 4x_1+20}. \]
M <- function(x,u)
{
x <- as.matrix(x)
u <- as.matrix(u)
out <- (1-exp(-1/(2*x[,2])))
out <- out * (1000*u[,1]*x[,1]^3+1900*x[,1]^2+2092*x[,1]+60)
out <- out / (100*u[,2]*x[,1]^3+500*x[,1]^2+4*x[,1]+20)
return(out)
}
The field data is generated as
\[ \begin{aligned} y^F(x) &= y^M(x, u^*) + b(x) + \varepsilon, \;\; \mbox{ where } \; b(x) = \frac{10x_1^2+4x_2^2}{50 x_1 x_2+10} \\ \mbox{ and } \; \varepsilon &\stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, 0.5^2), \label{eq:yF} \end{aligned} \]
using \(u^*=(0.2, 0.1)\).
In R:
bias <- function(x)
{
x<-as.matrix(x)
out<- 2*(10*x[,1]^2+4*x[,2]^2) / (50*x[,1]*x[,2]+10)
return(out)
}
Conider \(N_F = 100\) field data runs comprised of two replicates of a 50-sized 2d LHS of \(x\)-values.
Zu is the intermediate computer model evaluation at \(u^*\).ny <- 50
X <- randomLHS(ny, 2)
u <- c(0.2, 0.1)
Zu <- M(X, matrix(u, nrow=1))
sd <- 0.5
reps <- 2
Y <- rep(Zu, reps) + rep(bias(X), reps) + rnorm(reps*length(Zu), sd=sd)
length(Y)
## [1] 100
Augment the field data with \(N_M = 10500\) computer model runs comprised of
nz <- 10000
XU <- randomLHS(nz, 4)
XU2 <- matrix(NA, nrow = 10*ny, ncol=4)
for(i in 1:10) {
I <- ((i-1)*ny + 1):(ny*i)
XU2[I, 1:2] <- X
}
XU2[ ,3:4] <- randomLHS(10*ny, 2)
XU <- rbind(XU, XU2)
Z <- M(XU[,1:2], XU[,3:4])
length(Z)
## [1] 10500
Z contains our \(y^M\) values.The following block of code sets default priors and specifies details of the model(s) to be estimated.
bias.est <- TRUE ## change to FALSE for unbiased version
methods <- rep("alc", 2) ## two passes of laGP design/MLE
da <- d <- darg(NULL, XU)
g <- garg(list(mle = TRUE), Y)
The prior is completed with a (log) prior density on the calibration parameter, \(u\), chosen to discourage settings on the “edges” of the space.
beta.prior <- function(u, a=2, b=2, log=TRUE)
{
if(length(a) == 1) a <- rep(a, length(u))
else if(length(a) != length(u)) stop("length(a) must be 1 or length(u)")
if(length(b) == 1) b <- rep(b, length(u))
else if(length(b) != length(u)) stop("length(b) must be 1 or length(u)")
if(log) return(sum(dbeta(u, a, b, log = TRUE)))
else return(prod(dbeta(u, a, b, log = FALSE)))
}
Now we are ready to evaluate the objective on a “grid” to search for a starting value for NOMAD.
Here is the “grid”, via maximin LHS away from the edges.
initsize <- 10*ncol(X)
uinit <- maximinLHS(initsize, 2)
uinit <- 0.9*uinit + 0.05
Here are the objective evaluations on that “grid”.
llinit <- rep(NA, nrow(uinit))
for(i in 1:nrow(uinit)) {
llinit[i] <- fcalib(uinit[i,], XU, Z, X, Y, da, d, g, beta.prior,
methods, NULL, bias.est, nth, verb=0)
}
NOMADAn R interface to NOMAD is provided by snomadr in the crs package,
NOMAD options.laGP-based calibration examples.library(crs)
imesh <- 0.1
opts <- list("MAX_BB_EVAL"=1000, "INITIAL_MESH_SIZE"=imesh,
"MIN_POLL_SIZE"="r0.001", "DISPLAY_DEGREE"=0)
The code on the following slide invokes snomadr on the best input(s) found on the “grid”,
NOMAD iterations has been reached.Usually one pass is sufficient to meet the iteration threshold.
its <- 0; i <- 1; out <- NULL
o <- order(llinit)
while(its < 10) {
outi <- snomadr(fcalib, 2, c(0,0), 0, x0=uinit[o[i],], lb=c(0,0), ub=c(1,1),
opts=opts, XU=XU, Z=Z, X=X, Y=Y, da=da, d=d,g=g, methods=methods, M=NULL, verb=0,
bias=bias.est, omp.threads=nth, uprior=beta.prior, save.global=.GlobalEnv)
its <- its + outi$iterations
if(is.null(out) || outi$objective < out$objective) out <- outi
i <- i + 1
}
Then, extract information for visualizing/interpolating a posterior surface over \(u\).
Xp <- rbind(uinit, as.matrix(fcalib.save[,1:2]))
Zp <- c(-llinit, fcalib.save[,3])
wi <- which(!is.finite(Zp))
if(length(wi) > 0) { Xp <- Xp[-wi, ]
Zp <- Zp[-wi]}
library(akima)
surf <- interp(Xp[,1], Xp[,2], Zp, duplicate = "mean")
u.hat <- out$solution
image(surf, xlab="u1", ylab="u2", col=heat.colors(128), xlim=c(0,1), ylim=c(0,1))
points(uinit); points(fcalib.save[,1:2], col = 3, pch = 18)
points(u.hat[1], u.hat[2], col = 4, pch = 18)
abline(v=u[1], lty = 2); abline(h=u[2], lty = 2)
Observe that the true \(u^*\) value is far from the \(\hat{u}\) that we found.
Since there are far fewer evaluations made near \(u^*\),
Xu <- cbind(X, matrix(rep(u, ny), ncol=2, byrow=TRUE))
Mhat.u <- aGP.seq(XU, Z, Xu, da, methods, ncalib=2, omp.threads=nth, verb=0)
cmle.u <- discrep.est(X, Y, Mhat.u$mean, d, g, bias.est, FALSE)
cmle.u$ll <- cmle.u$ll + beta.prior(u)
c(u.hat= -out$objective, u=cmle.u$ll)
## u.hat u
## -124.5915 -129.3812
Lets see which (\(\hat{u}\) or \(u^*\)) leads to better prediction out-of-sample.
nny <- 1000
XX <- randomLHS(nny, 2)
ZZu <- M(XX, matrix(u, nrow=1))
YYtrue <- ZZu + bias(XX)
First, prediction with the true \(u^*\).
XXu <- cbind(XX, matrix(rep(u, nny), ncol=2, byrow=TRUE))
Mhat.oos.u <- aGP.seq(XU, Z, XXu, da, methods, ncalib=2, omp.threads=nth, verb=0)
YYm.pred.u <- predGP(cmle.u$gp, XX)
YY.pred.u <- YYm.pred.u$mean + Mhat.oos.u$mean
rmse.u <- sqrt(mean((YY.pred.u - YYtrue)^2))
deleteGP(cmle.u$gp)
For the estimated \(\hat{u}\) we need to backtrack through what we did earlier,
Xu <- cbind(X, matrix(rep(u.hat, ny), ncol=2, byrow=TRUE))
Mhat <- aGP.seq(XU, Z, Xu, da, methods, ncalib=2, omp.threads=nth, verb=0)
cmle <- discrep.est(X, Y, Mhat$mean, d, g, bias.est, FALSE)
cmle$ll <- cmle$ll + beta.prior(u.hat)
Here is a sanity check that this gives the same objective evaluation as what came out of snomadr.
print(c(cmle$ll, -out$objective))
## [1] -124.5915 -124.5915
Now we can repeat what we did with the true \(u^*\) value with our estimated one \(\hat{u}\).
XXu <- cbind(XX, matrix(rep(u.hat, nny), ncol = 2, byrow = TRUE))
Mhat.oos <- aGP.seq(XU, Z, XXu, da, methods, ncalib=2, omp.threads=nth, verb=0)
YYm.pred <- predGP(cmle$gp, XX)
YY.pred <- YYm.pred$mean + Mhat.oos$mean
rmse <- sqrt(mean((YY.pred - YYtrue)^2))
How do our RMSEs compare?
c(u.hat=rmse, u=rmse.u)
## u.hat u
## 0.1337277 0.1443372